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Abstract 

We consider the problem of minimizing an objective function that depends on an orthonormal matrix. 
This situation is encountered when looking for common principal components, for example, and the 
Flury method is a popular approach. However, the Flury method is not effective for higher dimensional 
problems. We obtain several simple majorization-minizmation (MM) algorithms that provide solutions 
to this problem and are effective in higher dimensions. We then use simulated data to compare them 
with other approaches in terms of convergence and computational time. 



1 Introduction 

The minimization of the objective function 



/(D) = trjW^A^D'} 

9=1 



(1) 



is required for a potpourri of statistical problems. To minimize an objective function /(D) that depends on 
a orthonormal matrix D (such as an eigenvector matrix), the search space is the orthogonal Stiefel manifold. 
This manifold is an embedded sub- manifold of R pxp equal to the set of all orthonormal matrices A4 = 
{D € W xp : D'D = I p }, where I p denotes the p-dimensional identity matrix. The matrices Wi, . . . , Wq 
are positive-definite and are usually sample covariance matrices. The matrices Ai,...,Ag are diagonal 
matri ces with positive elements. 

In lFlurv and Gautschi ( 1984 ) a com mon principal components model for G groups is found by minimiz- 



((T|). SchottI ( 1998 ) and Boik ( 2003) use this o bject ive function to find common principal components 



ing 

on correlation matrices. iMerbouha and Mkhadril (]2004[ ) use this objective function as a regularized tech- 
nique in discriminant ana lysis with mixed discrete and continuous variables for generalized location models. 
Yang and Shahabil ([20061 ) use this decomposit ion for multivariate t i me se ries data sets and use it in various 
multimedia, med ical and nancial applic ations. ICeleux and Govaertl ([19951 ) give an expectation- maxim i zation 
(EM) algorithm ([Dempster et all 119771 ) wherein each M-step the minimization (Tl]) is preformed. iBoikl (|2007l) 
note that the common principal com ponents model has been employed in many fields such as the genetics, cli- 
matology, ontogeny, and other fields (Arnold and Phillips!. 1999 ; Klingenberg et al. . ll996 ; Kulkarni and Racl . 
20001: iKrzanowskil 119901 ISengupta and Bovlel 119981 lOksanen and Huttunenl . Il989j ). 

The Fl ury-Gautschi (FG ) algorithm ([Flury and Gautschil Il986l ) is the most popular algorithm to min- 
imize ([!}. iLefkomtch (12004 ) report that the FG "is computationally expensive, especially for large and/or 



many matric es." Boikl (20071) agree, pointing out that the Flury-Gautschi algorithm is slow in higher (p > 5) 



dimensions. Browne and McNicholas] (j2012t) also show that the FG al gorithm is slow in high dimensions. 
This limitation in application of the FG algorithm has had knock-on effects in methods that use it. For 
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example, in a high dimensional mixture modelling application, iBouvevron et al.l (|2007| ) avoid the common 
principal component models stating that they "require a complex iterative estimation based on the FG al- 



gorithm (jFlurv and Gautschl 1 1986T) and therefore they will be not considered here." To overcome a slow 



algorithm in high dimensions, Browne and McNicholasI ( 2012 ) implemented an accelerated line search (ALS) 
for optimization on the orthogonal Stiefel manifold in a mixture modelling application and showed that this 
outperforms the FG method in h igh dimensions and reduces the number of degenerate solutions produced by 
the EM algorithm. In their ALS, Browne and McNicholasI ( 2012 ) do not exploit the convexity of the objective 
function. In this pap er, however, we exploit convexity to ob tain several simple majoriz ation-minizm ation 
(MM) algorithms (c.f. iHunter and Langd. bood: iHunterl l2000l) following methodology from|KierJ (|2002l) . We 
then compare all algorithms that minimize ([1} in terms of convergence and computational time. 



2 Minimization on the orthogonal Stiefel manifold 



2.1 Flury Method 

Flurv and Gautschi (1986) suggest an algorithm based on pairwise minimization of the matrix D. That is, 



each pair of columns or eigenvectors of D is updated while holding the others fixed. These updates are based 
on the eigendecompostion of 2 x 2 matrices summed across groups. Then we are required to loop through 
all pairs of columns of the matrix D to comple t e a sin gle iteration. This makes the Flury method ineffective 
in higher dimensions. See lFlurv and Gautschil ([19861 ) for details on the algorithm. 



2.2 Accelerated Line Search 

An accelerated line search algorithm (ALS) on a manifold consists of selecting a search direction in the 
tangent space and then moving this direction until a 'reasonable' decrease in the objective function is found. 
Browne and McNicholasI (|20 12T) introduces an ALS algorithm to minimize the func tion in equation ([1]). An 
extensive review of optimization on matrix manifolds is given by Absil et al. (l2008j). Th is methods requires 
tunning parameters and we use the values suggested by iBrowne and McNicholasI "i 2012 ) . 



2.3 MM Algorithm 1 

We can exp loit the convexity of the objective function ([1]) to obtain a MM algorithm similar that given in 
Kiersl ( 2002 ). Three different MM algorithms are presented and each algorithm has a surrogate function of 
the form 

G 

/(D) = Y, tr{W 5 DA^D'} < C + tr {F t D} 

9=1 



where C is a constant that does not depend on D, F t 



w fc A g 1 DQ, uig is largest 



eigenvalue of the matrix W 9 , and subscript t denot es iteration number. The largest eigenvalu e of a matrix 
can be determined using the power iteration method (|von Mises and Pollaczek-Geiringer , Il929h . If we obtain 
the singular value decomposition F t = P t B t RJ, in which P t and Ri are orthonormal, and B t is diagonal, 
containing the singular values of F t on the diagonal. Then the update of the matrix D becomes D t+ i = RfP' f 
Then we iteratively repeats this process until convergence. 



2.4 MM Algorithm 2 

In the second MM algorithm, F t = Y^, g =i (WjDjAj 1 — afeW g D f ) , where a g is largest eigenvalue of the 
matrix A" 1 . Because A g is diagonal and positive definite, the largest eigenvalue of A^ 1 is easily determined. 
The minimum of the surrogate is found using the same method as in Section 12.31 
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2.5 MM Algorithm 3 

In the third MM algorithm, F t = X^ 9 =i 0^ s^tA^ 1 — AgDj), where X g is largest eigenvalue of the matrix 
A" 1 (g) W s . Because the matrix A" 1 (g) W g is the Kronecker product of two matrices, we have \ g = a g uj g . 
The minimum of the surrogate is found using the same method as in Section 12.31 

2.6 MM Algorithm 4 for the EVE model 

We iterate over MM algorithm 1 and MM algorithm 2. 



3 Simulation Study 

We simulate various instances of the problem of minimizing ([D to compare o ur approach to the Flury 



method and the accelerated line search used in iBrowne and McNicholasI ((2012). We randomly generated 



Wi , . . • , Wq where each was produced from ap+1 observations from the p-dimensional standard normal 
distribution. In addition, we randomly generated the diagonal elements Ai, . . . , A.q from the half-normal 
distribution. Then we varied the number of dimensions p. We used the identity matrix as a starting value 
for each algorithm and then we ran until convergence. For each simulation, we recorded the system time, 
the number of iterations, and value of the objective function at the converged solution. 

Table [1] shows averages of the system times and the number of iterations from 100 simulations of the six 
algorithms. The table also gives the the relative difference between the minimum and the converged minimum 
('% Diff.') for each case. For a particular simulation, if {t\, . . . ,ts} are the values of the objective function 
from the converged solutions from the six algorithms and we let i m ; n = min{ti, . . . , ig}, then difference 
percentage for algorithm k is (£& — t m ; n ) / tmini for fc = 1, . . . 6. Note that if an algorithm has a large '% Diff.' 
then we could use a stricter convergence criteria to improve the result. However, a stricter convergence 
criteria will also increase the number of iterations and thus the system time. To facilitate comparison, we 
used the same convergence criteria for each algorithm. 



Table 1: The average system times (in seconds), iterations and difference between convergence value and 
minimum from the six algorithms. 







p = 5, G = 5 






p = 20, G = 5 




Method 


Time 


Iter. 


% Diff. 


Time 


Iter. 


% Diff. 


ALS 


0.038 


34 


0.050 


0.292 


83 


8.549 


Flury 


0.050 


10 


0.011 


4.016 


27 


0.016 


MM 1 


0.055 


60 


0.007 


0.323 


218 


0.362 


MM 2 


0.027 


85 


0.010 


0.235 


318 


1.045 


MM 3 


0.165 


179 


0.017 


0.872 


580 


2.526 


MM 4 


0.035 


32 


0.001 


0.263 


128 


0.180 



Table Q] illustrates that the Flury method becomes computationally infeasible when the dimension in- 
creases from five to twenty. Also, it seems the ALS method and MM 3 algorithm tend to converge prema- 
turely. Conversely, MM 4 seems to retain computational efficiency while maintaing the same convergence 
rate as the Flury algorithm. Table [2] tells the same story as Table [T] but with higher dimensions. Results for 
the Flury method are not reported in the right-hand column of Table [2] due to prohibitive computational 
time. In the left-hand column of Table [21 the Flury method converged to the smallest value in each simula- 
tion. The MM 4 gives similar results to the Flurry algorithm but is just as fast as the ALS algorithm. We 
note that the running parameters for ALS algorithm could be adjusted to optimize the performance of the 
ALS algorithm. 
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Table 2: The average system times (in seconds), iterations and difference between convergence value and 
minimum from the six algorithms. 







p = 50, G = 5 






p = 100, G = 5 




Method 


Time 


Iter. 


% Diff. 


Time 


Iter. 


% Diff. 


ALS 


2.85 


174 


0.347 


9.66 


78 


0.500 


Flury 


101.71 


33 


0.000 








MM 1 


2.22 


303 


0.025 


10.51 


290 


0.010 


MM 2 


2.76 


565 


0.080 


22.12 


836 


0.120 


MM 3 


8.62 


961 


0.187 


32.85 


1521 


0.355 


MM 4 


2.26 


214 


0.016 


12.78 


230 


0.000 



4 Discussion 



We fin d that a MM algorithm is just as fast as the ALS algorithm introduced by Browne and McNicholasi 



(l2012h but has the same properties as the Flury method (jFlurv and Gautschil . Il986l ) In addition, the MM 
algorithm does not have any tunning parameters unlike the ALS algorithm. This will allow the implementa- 
tion of techniq ues for higher dim e nsiona l problems for which the Flury method is too slow. Examples include 
the method of Bouvevron et al. ( 20071 ) fo r clustering high dimension al data and parameter estimation for 
some of the mixture models considered bv lCeleux and Govaert (1995). 
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